Skip to main content

HTML Parsing

HTML parsing is the process by which the browser reads your raw HTML text (.html file) and converts it into a structured, in-memory representation called the DOM (Document Object Model).

1. HTML Source Code Arrives

When you open a webpage, the browser downloads the HTML file from the server.

<!DOCTYPE html>
<html>
<head>
<title>My Page</title>
</head>
<body>
<h1>Hello</h1>
<p>Welcome to HTML parsing!</p>
</body>
</html>

This is plain text — not yet structured or rendered.

2. Tokenization

The browser’s HTML parser starts reading this text character by character.

It breaks it into tokens — each representing an HTML construct:

  • Start tags (<html>, <body>)
  • End tags (</body>)
  • Text nodes (Hello, Welcome...)
  • Comments, attributes, etc.
< !DOCTYPE html >
< html >
< head >
< title >
My Page
</ title >
< body >
< h1 >
Hello
</ h1 >
...

3. DOM Tree Construction

As the tokens are recognized, the browser creates nodes and connects them hierarchically to build the DOM tree.

Document
└── html
├── head
│ └── title: "My Page"
└── body
├── h1: "Hello"
└── p: "Welcome to HTML parsing!"
  • A tree structure representing all elements and their relationships.
  • JavaScript and CSS interact with this tree.